R for data science (2 marks total)
The Wordle is a
web-based word game created and developed by Welsh software engineer
Josh Wardle.
It has become incredibly popular, thanks to its simplicity, playability
and challenge… and also because players around the world can show off
their prowess on by posting their daily results on social media as emoji
squares.
The aim of the game is to guess a five-letter word using a maximum of
six attempts.
There is a unique “wordle” every day and the rules are simple.
After every guess you receive feedback for each letter in your guess in
the form of a coloured square:
⬛ means the that the letter is not in the word
🟨 means the letter is in the word but in a different position to the
one you chose.
🟩 means the letter is in the word in the position you chose.
This Assessment asks you to investigate the results posted by a sample of Australian and international players.
…your data awaits…
../data,
rm(list=ls()) # Remove all objects
This assessment can and should be completed using only the following four libraries.
library('ggplot2')
library('dplyr')
library('lubridate')
library('plotly')
This assessment uses two datasets of tweet data from wordle results which we load here:
wordle.int <- readRDS('../data/wordle.int.RDS') # A sample of 20000 wordle results from international players
Warning in readRDS("../data/wordle.int.RDS"): strings not representable in native encoding will be translated to UTF-8
wordle.aus <- readRDS('../data/wordle.aus.RDS') # A sample of 11389 wordle results from Australian players
Both wordle.int and wordle.aus contain the
variables
wordle_id: a unique consecutive identifier of the
wordle.solution: the solution or wordle of the day:
e.g. POWERdate: date of the published result by the user on
Twitter.attempts: the number of attempts (out of 6) used to
solve the wordle.score: the score achieved, where each ⬛ = 2, 🟨 = 1
and 🟩 = 0. The smaller the better.vowels: number of vowels in the solution. It affects
the difficulty of the guess. The Y is considered a vowel
here.repeated: consecutive / double letters in the solution
e.g. SWEET. It affects the difficulty of the guess.freq: the log of the n-gram mean frequencies of the
word (during the last 5 years).
wday: day of the week (Mon, Tue, …, Sun).tweet: a tweet spoiler-free emoji grid published by the
users.The Australian dataset, wordle.aus also contain the
variable
city: the city of the Twitter user.Rows: 20,000
Columns: 11
Rowwise:
$ wordle_id <dbl> 386, 364, 234, 249, 231, 357, 327, 268, 279, 269, 389, 381, 412, 221, 372, 367, 380, 341, 230, 282,~
$ solution <chr> "BERTH", "CACAO", "FRAME", "TROVE", "ALOFT", "GOOSE", "SLUNG", "SMELT", "DEPOT", "TEASE", "BLAND", ~
$ date <date> 2022-07-10, 2022-06-18, 2022-02-08, 2022-02-23, 2022-02-05, 2022-06-11, 2022-05-12, 2022-03-14, 20~
$ attempts <dbl> 5, 6, 5, 5, 4, 5, 6, 3, 4, 2, 3, 5, 6, 3, 4, 3, 3, 3, 4, 2, 4, 4, 3, 4, 2, 5, 5, 3, 4, 6, 4, 6, 2, ~
$ score <dbl> 26, 34, 18, 16, 11, 33, 26, 15, 21, 7, 8, 28, 33, 6, 22, 12, 11, 12, 17, 7, 15, 18, 16, 14, 7, 24, ~
$ vowels <int> 1, 3, 2, 2, 2, 3, 1, 1, 2, 3, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 3, 2, 2, 1, 2, 2, ~
$ repeated <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ freq <dbl> -19.21077, -17.90610, -15.74757, -19.01154, -19.23042, -16.46714, -19.97755, -19.25412, -17.31333, ~
$ wday <ord> Sun, Sat, Tue, Wed, Sat, Sat, Thu, Mon, Fri, Tue, Wed, Tue, Fri, Wed, Sun, Tue, Mon, Thu, Fri, Mon,~
$ tweet <chr> "\n\n<U+2B1B><U+2B1B><U+2B1B><U+0001F7E8><U+0001F7E8>\n<U+0001F7E8><U+0001F7E9><U+2B1B><U+0001F7E8><U+2B1B>\n<U+2B1B><U+0001F7E9><U+0001F7E8><U+2B1B><U+0001F7E8>\n<U+0001F7E8><U+0001F7E9><U+0001F7E8><U+2B1B><U+2B1B>\n<U+0001F7E9><U+0001F7E9><U+0001F7E9><U+0001F7E9><U+0001F7E9>", "\n\n<U+2B1B><U+0001F7E8><U+2B1B><U+2B1B><U+2B1B>\n<U+0001F7E9><U+2B1B><U+0001F7E8><U+2B1B><U+2B1B>\n<U+0001F7E9><U+0001F7E9>~
$ tweet_text <chr> "Wordle 386 5/6\n\n<U+2B1B><U+2B1B><U+2B1B><U+0001F7E8><U+0001F7E8>\n<U+0001F7E8><U+0001F7E9><U+2B1B><U+0001F7E8><U+2B1B>\n<U+2B1B><U+0001F7E9><U+0001F7E8><U+2B1B><U+0001F7E8>\n<U+0001F7E8><U+0001F7E9><U+0001F7E8><U+2B1B><U+2B1B>\n<U+0001F7E9><U+0001F7E9><U+0001F7E9><U+0001F7E9><U+0001F7E9>", "Wordle 364 6/6\n\n~
<U+0001F7E8><U+0001F7E9><U+0001F7E9><U+0001F7E8><U+0001F7E8><U+0001F7E8><U+0001F7E9><U+2B1B><U+0001F7E9><U+0001F7E9><U+0001F7E9><U+0001F7E8><U+0001F7E8><U+2B1B><U+0001F7E8><U+2B1B><U+0001F7E8><U+0001F7E9><U+0001F7E8><U+2B1B><U+2B1B><U+0001F7E8><U+0001F7E8><U+2B1B><U+2B1B><U+0001F7E8><U+0001F7E9><U+2B1B><U+0001F7E8><U+0001F7E8><U+0001F7E9><U+2B1B><U+0001F7E8><U+2B1B><U+2B1B><U+0001F7E8><U+0001F7E9><U+2B1B><U+2B1B><U+0001F7E9><U+0001F7E9><U+2B1B><U+0001F7E9><U+0001F7E9><U+0001F7E9><U+0001F7E8><U+0001F7E9><U+0001F7E9><U+0001F7E9>
This question asks you to summarise and explore the
wordle.int dataset, a sample of 20,000 tweets about wordle
from around the world. This kind of summarisation is a critical step in
understanding any dataset.
Demonstrate R commands to generate summary statistics for each wordle
in the wordle.int dataset:
wordle, solution,
vowels, repeated, freq, and
wdayscore
and for the attempts variable.score and
attemptsscore and
attemptswordle.int.summary and print the
first 10 rows of that object# A tibble: 216 x 12
# Groups: date, solution, vowels, repeated, freq [216]
date solution vowels repeated freq wday score_n score_mean score_sd attempts_n attempts_mean attempts_sd
<date> <chr> <int> <dbl> <dbl> <ord> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 2022-01-15 PANIC 2 0 -16.5 Sat 67 18.5 7.39 67 3.78 7.39
2 2022-01-16 SOLAR 2 0 -15.7 Sun 60 17.4 6.83 60 4.03 6.83
3 2022-01-17 SHIRE 2 0 -18.7 Mon 45 13.3 7.29 45 3.44 7.29
4 2022-01-18 PROXY 2 0 -17.6 Tue 50 25.1 8.31 50 4.72 8.31
5 2022-01-19 POINT 2 0 -14.0 Wed 55 17.2 6.01 55 3.55 6.01
6 2022-01-20 ROBOT 2 0 -16.6 Thu 57 20.1 8.82 57 3.91 8.82
7 2022-01-21 PRICK 1 0 -18.6 Fri 145 18.3 7.29 145 3.85 7.29
8 2022-01-22 WINCE 2 0 -20.8 Sat 59 21.8 7.99 59 4.39 7.99
9 2022-01-23 CRIMP 1 0 -19.3 Sun 102 18.7 6.97 102 3.95 6.97
10 2022-01-24 KNOLL 1 1 -19.1 Mon 72 26.3 6.68 72 4.54 6.68
# … with 206 more rows
wordle.int %>%
group_by(date, wordle_id, solution, vowels, repeated, freq, wday) %>%
summarise(score_n = sum(!is.na(score)), score_mean = mean(score), score_sd = sd(score), attempts_n = sum(!is.na(attempts)), attempts_mean = mean(attempts), attempts_sd = sd(attempts)) -> wordle.int.summary
head(wordle.int.summary, 10)
# A tibble: 10 x 13
# Groups: date, wordle_id, solution, vowels, repeated, freq [10]
date wordle_id solution vowels repeated freq wday score_n score_mean score_sd attempts_n attempts_mean
<date> <dbl> <chr> <int> <dbl> <dbl> <ord> <int> <dbl> <dbl> <int> <dbl>
1 2022-01-15 210 PANIC 2 0 -16.5 Sat 67 18.5 7.39 67 3.78
2 2022-01-16 211 SOLAR 2 0 -15.7 Sun 60 17.4 6.83 60 4.03
3 2022-01-17 212 SHIRE 2 0 -18.7 Mon 45 13.3 7.29 45 3.44
4 2022-01-18 213 PROXY 2 0 -17.6 Tue 50 25.1 8.31 50 4.72
5 2022-01-19 214 POINT 2 0 -14.0 Wed 55 17.2 6.01 55 3.55
6 2022-01-20 215 ROBOT 2 0 -16.6 Thu 57 20.1 8.82 57 3.91
7 2022-01-21 216 PRICK 1 0 -18.6 Fri 145 18.3 7.29 145 3.85
8 2022-01-22 217 WINCE 2 0 -20.8 Sat 59 21.8 7.99 59 4.39
9 2022-01-23 218 CRIMP 1 0 -19.3 Sun 102 18.7 6.97 102 3.95
10 2022-01-24 219 KNOLL 1 1 -19.1 Mon 72 26.3 6.68 72 4.54
# ... with 1 more variable: attempts_sd <dbl>
Note In case you cannot complete Q1.1, we now
provide wordle.int.summary to be used from here on.
wordle.int.summary <- readRDS("../data/wordle.int.summary.RDS")
wordle.int.summary containing different numbers of
vowels , rounded to 3 decimal places, ensuring that any
missing values (NAs) are included.
# Write your answer here
table(wordle.int.summary$vowels, useNA = "ifany")
1 2 3 4
50 136 29 1
# We can use this table to obtain proportions by dividing each value by the total number of wordles in the data set. E.g.
totalVowels = 50 + 136 + 29 + 1 # Values taken from above table function.
oneVowel = round((50/totalVowels), 3)
twoVowels = round((136/totalVowels), 3)
threeVowels = round((29/totalVowels), 3)
fourVowels = round((1/totalVowels), 3)
cat("Proportion of wordles containing one vowel: ", oneVowel, "\nContaining two vowels: ", twoVowels,
"\nContaining three vowels: ", threeVowels, "\nContaining four vowels", fourVowels)
Proportion of wordles containing one vowel: 0.231
Containing two vowels: 0.63
Containing three vowels: 0.134
Containing four vowels 0.005
wordle.int.summary
# Write your answer here
n_distinct(wordle.int.summary$date)
[1] 216
Demonstrate R commands that add a month variable to
wordle.int.summary so you can answer the following
question.
wordle.int.summarylubridate’s month() function. *
wordle.int.summary is a
grouped data frame * Depending on your approach, you may
have to remove the grouping to answer these questions
# Write your answer here
wordle.int.summary <- wordle.int.summary %>%
mutate(month = month.abb[lubridate::month(date)])
View(wordle.int.summary) # We manually view the data set to see which word in July 2022 has the highest mean number of attempts
print("Based on the above commands, the hardest word in July 2022 is 'Cinch' on the 26th, with a mean attempt of 4.55...")
[1] "Based on the above commands, the hardest word in July 2022 is 'Cinch' on the 26th, with a mean attempt of 4.55..."
wordle.int.summary, demonstrate R commands that
compute and display the following summary statistics, by month, for the
attempts_mean variable:
Note We have asked you to summarise the mean number
of attempts of each wordle: in effect, you are providing summaries of a
summary statistic. The results that you will get will be slightly
different than if you were to work with the individual observations
(i.e., the integer number of attempts in wordle.int). We
have done this to avoid overplotting in Q1.6.
# Write your answer here
wordle.int.summary %>%
group_by(month) %>%
summarize(mean = mean(attempts_mean), median = median(attempts_mean), standard_dev = sd(attempts_mean), Q1 = quantile(attempts_mean, 0.25),
Q3 = quantile(attempts_mean, 0.75))
# A tibble: 8 x 6
month mean median standard_dev Q1 Q3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Apr 4.16 4.14 0.356 3.99 4.46
2 Aug 4.28 4.28 0.381 4.08 4.49
3 Feb 4.14 4.16 0.346 3.93 4.33
4 Jan 4.05 4.02 0.331 3.89 4.25
5 Jul 4.19 4.20 0.234 3.99 4.37
6 Jun 4.15 4.09 0.334 3.94 4.28
7 Mar 4.15 4.19 0.311 3.94 4.32
8 May 4.14 4.12 0.314 3.96 4.29
Demonstrate R commands create violin plots of
attempts_mean for each week day of the
wordle.int.summary data
# Write your answer here
ggplot(data=wordle.int.summary, aes(x = wday, y = attempts_mean)) +
geom_violin() +
geom_boxplot() +
geom_jitter(height = 0.05, width = 0.1, colour = "blue", alpha = 0.15) +
labs(title = "International Wordle data",
subtitle = "2022-01-15 to 2022-08-21",) +
xlab("") +
scale_y_continuous(name = "daily mean number of attempts",
limits = c(1,6),
breaks = 1:7,
expand = c(0,0))
Consider the summaries that you have been calculating. Ask yourself: how confident are you that they are based on correct data?
Consider the following plot and the code that generated it.
ggplot(wordle.int, aes(x=factor(attempts))) +
geom_bar() +
facet_grid(month(date, label = TRUE)~.) +
labs(
title="Monthly summaries of wordle attempts",
subtitle="International wordle data",
x="number of attempts")
Comment on what this graph shows, what it implies for your summaries of the number of attempts, and what you would do in light of this.
Write your answer here as an Rmarkdown blockquote (lines beginning with>):
The number of attempts monthly roughly follow a normal distribution. It confirms previous summaries of number of attempts that have the mean revolve around 4. Though it may also confirm outliers in the data. We could fix this by potentially removing the data point, assign the next value nearer to the median in place of the outlier value, etc.
<U+0001F7E8><U+0001F7E9><U+2B1B><U+0001F7E9><U+0001F7E9><U+0001F7E9><U+0001F7E9><U+0001F7E9><U+0001F7E9><U+0001F7E9><U+0001F7E9><U+2B1B><U+0001F7E9><U+0001F7E9><U+0001F7E8><U+0001F7E9><U+0001F7E9><U+0001F7E8><U+2B1B><U+2B1B><U+0001F7E9><U+0001F7E9><U+2B1B><U+0001F7E8><U+0001F7E8><U+0001F7E8><U+2B1B><U+0001F7E8><U+2B1B><U+2B1B><U+0001F7E9><U+2B1B><U+2B1B><U+0001F7E8><U+0001F7E8><U+2B1B><U+0001F7E9><U+2B1B><U+2B1B><U+0001F7E9><U+0001F7E8><U+0001F7E9><U+0001F7E8><U+0001F7E8><U+2B1B><U+0001F7E9><U+2B1B><U+2B1B><U+0001F7E8>
The sample of Australian wordles in wordle.aus was
created by selecting tweets whose geolocation showed they originated
within a 40km radius of state and territory capitals.
score and attempts by city in the
Australian wordle dataset (wordle.aus)
# Write your answer here
wordle.aus %>%
group_by(city) %>%
summarize(mean_score = mean(score), sd_score = sd(score), mean_attempts = mean(attempts), sd_attempts = sd(attempts))
# A tibble: 8 x 5
city mean_score sd_score mean_attempts sd_attempts
<chr> <dbl> <dbl> <dbl> <dbl>
1 adelaide 19.1 7.51 4.09 1.06
2 brisbane 19.3 7.42 4.11 1.09
3 canberra 19.2 7.39 4.10 1.08
4 darwin 19.0 6.29 3.96 0.903
5 hobart 19.5 7.42 4.04 1.02
6 melbourne 19.5 7.28 4.14 1.07
7 perth 19.4 7.44 4.11 1.10
8 sydney 19.7 7.16 4.15 1.05
Here we provide a more detailed data frame with attempts
and score summarised by city per date.
# A tibble: 240 x 10
# Groups: city, date, freq [240]
city date freq solution score_n score_mean score_sd attampts_n attempts_mean attempts_sd
<chr> <date> <dbl> <chr> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 adelaide NA -22.7 COYLY 25 26.6 5.53 25 5.24 0.831
2 adelaide NA -20.3 HUNKY 29 23.3 6.17 29 4.31 1.07
3 adelaide NA -20.1 ELOPE 31 20 7.14 31 4.16 0.969
4 adelaide NA -20.0 GRUEL 27 23 6.10 27 4.52 0.893
5 adelaide NA -19.9 CINCH 27 22.4 5.78 27 4.48 0.975
6 adelaide NA -19.9 GLEAN 28 17.5 7.32 28 4 1.22
7 adelaide NA -19.8 SHRUG 30 18.2 6.43 30 4 1.05
8 adelaide NA -19.3 TWANG 30 17.3 6.74 30 3.73 0.785
9 adelaide NA -19.3 UNFIT 34 20.4 6.04 34 4.12 1.04
10 adelaide NA -19.3 KHAKI 29 23.7 6.63 29 4.31 0.850
# ... with 230 more rows
Using the data frame wordle.aus.city.date and
ggplot(), create an interactive visualisation of the
relationship between attempts_mean and
score_mean with a scatter plot.
size_n
alpha of 0.5 to deal with overplottingscore_n to the weight aestheticgeom_smooth() functionggplotly in the package
plotly to produce an interactive graph.
# Write your answer here
ggplot(data=wordle.aus.city.date, aes(x=attempts_mean, y=score_mean)) +
geom_point(alpha=0.5, aes(colour=factor(city), size=score_n)) +
geom_smooth(method="lm", aes(weight=score_n)) -> Q2.2
ggplotly(Q2.2)
Using wordle.aus.city.date, demonstrate code to:
# Write your answer here
syd_mean_attempts <- wordle.aus.city.date$attempts_mean[wordle.aus.city.date$city=="sydney"]
canb_mean_attempts <- wordle.aus.city.date$attempts_mean[wordle.aus.city.date$city=="canberra"]
# Write your answer here
t.test(syd_mean_attempts, canb_mean_attempts)
Welch Two Sample t-test
data: syd_mean_attempts and canb_mean_attempts
t = 0.44163, df = 52.241, p-value = 0.6606
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1640305 0.2566177
sample estimates:
mean of x mean of y
4.163191 4.116898
>):
The t-test tests two hypotheses, being the null hypothesis (no difference between the true averages of the two groups) versus the alternate hypothesis (a difference between the true averages of the two groups). The p-value of 0.6606 is the probability of getting a t-statistic whose magnitude is equal to or greater than 0.44163 (test statistic), with 52.2 degrees of freedom. This p-value means that we cannot reject the null hypothesis. Based on this, we can say that there is not enough evidence of a difference between the two averages of the two groups at the usual significance level of 0.05.
<U+2B1B><U+0001F7E9><U+2B1B><U+0001F7E9><U+0001F7E8><U+2B1B><U+2B1B><U+2B1B><U+0001F7E9><U+0001F7E8><U+0001F7E8><U+2B1B><U+0001F7E9><U+0001F7E9><U+0001F7E8><U+2B1B><U+0001F7E9><U+2B1B><U+0001F7E8><U+0001F7E8><U+0001F7E8><U+0001F7E8><U+0001F7E9><U+0001F7E9><U+2B1B><U+0001F7E9><U+0001F7E9><U+0001F7E8><U+2B1B><U+0001F7E9><U+0001F7E9><U+2B1B><U+2B1B><U+2B1B><U+2B1B><U+2B1B><U+0001F7E8><U+0001F7E9><U+0001F7E8><U+2B1B><U+0001F7E9><U+0001F7E9><U+2B1B><U+0001F7E9><U+0001F7E8><U+0001F7E9><U+2B1B><U+0001F7E9><U+2B1B>
This question explores two different linear regression models of the
wordle.int.summary data to predict
attempts_mean.
Use the following covariates or independent variables to predict the
response (attempts_mean):
freq, vowels and
repeated.freq and vowels .# Write your answer here
model_1.lm <- lm(attempts_mean ~ freq + vowels + repeated, data = wordle.int.summary)
model_2.lm <- lm(attempts_mean ~ freq + vowels, data = wordle.int.summary)
model_1.lm
Call:
lm(formula = attempts_mean ~ freq + vowels + repeated, data = wordle.int.summary)
Coefficients:
(Intercept) freq vowels repeated
2.9893 -0.0569 0.0701 0.2543
model_2.lm
Call:
lm(formula = attempts_mean ~ freq + vowels, data = wordle.int.summary)
Coefficients:
(Intercept) freq vowels
3.04380 -0.05673 0.05793
# Write your code here
model_1.lm$coefficients
(Intercept) freq vowels repeated
2.98926510 -0.05690313 0.07010316 0.25432926
Write your answer here as an Rmarkdown blockquote (lines beginning
with >):
For every 1 unit of frequency increase, the expected mean attempts lowers by 0.057. For every 1 unit of number of vowels increase in a wordle, the expected mean attempts increases by 0.70.
# Write your answer here
AIC(model_1.lm)
[1] 90.86752
AIC(model_2.lm)
[1] 104.3646
Model 1 better fits the data. AIC penalizes models that use more parameters. Both models explain the same variation, and since model 1 has lower AIC, it has fewer parameters and thus will have a better fit model.
Interpret the apparent effect of repeated characters on
attempts_mean in Model 1.
>):
From Q3.2, we can see that the repeated predictor in model 1 has a coefficient of 0.254. This means that for every 1 unit increase in repeated characters, the expected mean attempts increases by 0.254.
Demonstrate code to visually assess whether the residuals in Model 1 are normally distributed and briefly discuss what you conclude and why?
plot.lm plots a range of diagnostics for
lm objects
# Write your answer here
plot(model_1.lm)
# See workshop 9.1
Write your answer here as an Rmarkdown blockquote (lines beginning
with >):
The Normal Q-Q plot seems to indicate a normal distribution. The circles within the plot seem to follow the normality line well.
<U+2B1B><U+0001F7E8><U+0001F7E8><U+2B1B><U+2B1B><U+0001F7E8><U+2B1B><U+2B1B><U+2B1B><U+2B1B><U+0001F7E8><U+0001F7E9><U+0001F7E9><U+0001F7E8><U+0001F7E8><U+2B1B><U+0001F7E9><U+2B1B><U+0001F7E9><U+0001F7E9><U+0001F7E8><U+0001F7E8><U+0001F7E9><U+0001F7E8><U+2B1B><U+0001F7E9><U+2B1B><U+0001F7E8><U+0001F7E8><U+0001F7E9><U+0001F7E9><U+0001F7E9><U+2B1B><U+2B1B><U+2B1B><U+0001F7E8><U+0001F7E9><U+0001F7E8><U+2B1B><U+0001F7E8><U+0001F7E8><U+2B1B><U+2B1B><U+2B1B><U+2B1B><U+0001F7E8><U+0001F7E9><U+2B1B><U+0001F7E8>
Here, we create an additional variable called difficulty
in the dataframe wordle.int.summary
attempts_mean \(\leq
4.1\) then difficulty is lowattempts_mean \(>
4.1\) then difficulty is highwordle.int.summary %>%
ungroup() %>%
mutate(
difficulty = factor(
ifelse(attempts_mean <= 4.1, "low", "high"),
levels=c("low", "high")
)
) -> wordle.int.summary
Next, we randomly (but reproducibly) partition
wordle.int.summary into training and
testing sets:
set.seed(1) # for reproducibility
sample <- sample(nrow(wordle.int.summary),floor(nrow(wordle.int.summary)*0.7))
training <- wordle.int.summary[ sample,]
testing <- wordle.int.summary[-sample,]
freq, vowels and the repeated as
predictors of difficulty
# Write your answer here
glm(difficulty ~ freq + vowels + repeated, data=training, family="binomial") -> training.glm
training.glm
Call: glm(formula = difficulty ~ freq + vowels + repeated, family = "binomial",
data = training)
Coefficients:
(Intercept) freq vowels repeated
-5.0909 -0.2527 0.4821 1.3258
Degrees of Freedom: 150 Total (i.e. Null); 147 Residual
Null Deviance: 202.9
Residual Deviance: 189.8 AIC: 197.8
difficulty in the testing set and save
the results to the testing set as a new variable called
prediction
# Write your answer here
testing$prediction <- training.glm %>% predict(testing, type="response")
testing$prediction
1 2 3 4 5 6 7 8 9 10 11 12
0.6445979 0.5781986 0.3541735 0.5189757 0.7568412 0.5648915 0.4186560 0.4216720 0.4031905 0.6035482 0.4638245 0.6849840
13 14 15 16 17 18 19 20 21 22 23 24
0.8756442 0.2896762 0.7005997 0.6022788 0.6600108 0.3241230 0.6287186 0.4843781 0.6101171 0.7897278 0.5618286 0.4023296
25 26 27 28 29 30 31 32 33 34 35 36
0.6788279 0.8444698 0.7798131 0.5797514 0.5960917 0.5406892 0.6472906 0.2902641 0.6403225 0.6897535 0.6471455 0.4133288
37 38 39 40 41 42 43 44 45 46 47 48
0.5103873 0.8993954 0.5184869 0.7244255 0.6786495 0.7363602 0.7192409 0.4021544 0.3674010 0.6039731 0.5411835 0.8018943
49 50 51 52 53 54 55 56 57 58 59 60
0.4821190 0.7069143 0.4272740 0.6891867 0.8265687 0.7812279 0.4902826 0.5612018 0.7308136 0.6617327 0.5991350 0.8900565
61 62 63 64 65
0.5786064 0.7098222 0.4774360 0.4794829 0.4370228
testing$prediction to create
a new variable called testing$predicted.difficulty whose
value of “low” if testing$prediction \(< 0.5\) and “high” otherwise. * Print
out the first ten values of testing$predicted.difficulty
# Write your answer here
testing %>%
ungroup() %>%
mutate(
predicted.difficulty = factor(
ifelse(prediction < 0.5, "low", "high"),
levels=c("low", "high")
)
) -> testing
head(testing$predicted.difficulty, 10)
1 2 3 4 5 6 7 8 9 10
high high low high high high low low low high
Levels: low high
table() actual
predicted low high
low 13 6
high 21 25
# Write your answer here
confusion_matrix <- table(testing$predicted.difficulty, testing$difficulty)
names(dimnames(confusion_matrix)) <- c("predicted", "actual")
confusion_matrix
actual
predicted low high
low 13 6
high 21 25
TPR, the true positive rate of your
model on the test setFPR, the false positive rate of your
model on the test set
# Write your answer here
TP = 25
FP = 21
FN = 6
TN = 13
P = TP + FN
N = TN + FP
TPR = TP/P
FPR = FP/N
TPR
[1] 0.8064516
FPR
[1] 0.6176471
simple.roc()
function (below) to plot the Receiver Operating Characteristic curve of
your model’s predictions on the test set
# Modified from https://blog.revolutionanalytics.com/2016/08/roc-curves-in-two-lines-of-code.html
# labels is a logical vector indicating whether an example belongs to the positive class
# scores is the predicted probability of that example belonging to the positive class
simple.roc <- function(labels, scores){
ordered.scores <- order(scores, decreasing=TRUE)
labels <- labels[ordered.scores]
tibble(
TPR=c(0,cumsum( labels))/sum( labels),
FPR=c(0,cumsum(!labels))/sum(!labels),
labels=c(NA,labels),
score=c(Inf, scores[ordered.scores])
)
}
# Write your answer here
#difficulty.roc <- simple.roc(testing$predicted.difficulty=="high", )
#diagonal <- data.frame(FPR=c(0,1), TPR=c(0,1))
#ggplot(data=difficulty.roc, aes(x=FPR, TPR)) +
#coord_equal(xlim=c(0,1), ylim=c(0,1)) +
#geom_line(data=diagonal, lty=2) +
#labs(xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
# Uncomment to terminate knitting
#knitr::knit_exit()
<U+0001F7E9><U+2B1B><U+0001F7E9><U+2B1B><U+0001F7E8><U+0001F7E8><U+0001F7E8><U+0001F7E9><U+2B1B><U+2B1B><U+2B1B><U+0001F7E9><U+0001F7E9><U+2B1B><U+2B1B><U+0001F7E8><U+0001F7E8><U+0001F7E8><U+0001F7E9><U+0001F7E8><U+2B1B><U+0001F7E9><U+2B1B><U+0001F7E8><U+2B1B><U+0001F7E9><U+2B1B><U+2B1B><U+0001F7E9><U+0001F7E9><U+2B1B><U+2B1B><U+2B1B><U+2B1B><U+2B1B><U+2B1B><U+2B1B><U+0001F7E8><U+0001F7E8><U+0001F7E8><U+2B1B><U+2B1B><U+2B1B><U+0001F7E8><U+0001F7E8><U+0001F7E8><U+0001F7E8><U+0001F7E8><U+2B1B>
Choose only one of the following two questions to answer.
As a guide, each of your answers should be around 5 lines of text.
Word Tips published some interesting Wordle results comparing the Wordle solving skills of different cities worldwide.
>):
>):
>):
Multiple users have tweeted wordle results with a perfect score, i.e., they have correctly guessed the word in one attempt. This seems quite unlikely to happen.
>):
I believe that is a balanced amount of ethically significant benefits or harms in cheating in games such as Wordle. Generally, playing Wordle does not have an external effect to the people around you irregardless of whether you are playing honestly or dishonestly. Neither completing nor failing to do your Wordle will not bring about any significant benefit to an individual’s life. Though you could bring the argument that ethical practice of following rules and specifications is necessary, and willfully ignoring it is an ethically significant harm. Furthermore, cheating presents inaccuracies and outliers within data that may affect the human understanding and reduces fairness and justice.
>):
I would not cheat at Wordle. An ethical framework presented by Shannon Vallor in Introduction to Data Ethics that forms my reasoning is Deontological Ethics, or rule or principle-based systems of ethics. The rules in Wordle does not permit us to cheat, and to play to game truthfully with natural, human mistakes arising. You are not allowed to know of the words or the potential letters before you start playing your daily Wordle. This applies to all games, generally, in which you must not give yourself an unfair advantage over other players to complete the game as this is a breach of morals, truthfulness and the general rules of the game.
>):
There are several different scenarios in which people may be tempted to cheat or misrepresent the truth. One major example of a misrepresentation of truth in recent times are people claiming that various elections have been ‘rigged’ are claim that votes have been miscounted in favour or a different candidate which may not cater to their political compass. Some news outlets have also enabled this, generally in favour of one political party over the other, and misrepresent data in their broadcasts to imply that one candidate is more popular than the other. This misleads the population of just exactly that, and may influence their votes in not only a current election, but also future elections.
<U+0001F7E8><U+0001F7E8><U+0001F7E8><U+0001F7E8><U+2B1B><U+2B1B><U+0001F7E9><U+2B1B><U+0001F7E8><U+2B1B><U+2B1B><U+0001F7E9><U+2B1B><U+0001F7E9><U+2B1B><U+0001F7E8><U+0001F7E8><U+0001F7E8><U+0001F7E9><U+2B1B><U+0001F7E9><U+2B1B><U+2B1B><U+2B1B><U+0001F7E9><U+2B1B><U+0001F7E9><U+0001F7E8><U+2B1B><U+0001F7E9><U+0001F7E9><U+0001F7E8><U+2B1B><U+0001F7E9><U+0001F7E8><U+2B1B><U+0001F7E8><U+0001F7E9><U+2B1B><U+0001F7E8><U+0001F7E8><U+0001F7E8><U+2B1B><U+2B1B><U+2B1B><U+2B1B><U+0001F7E8><U+2B1B><U+2B1B>
R for data science (2 marks
total)Assessment02.Rmd)
Assessment02.html in your working directoryn12345678
n12345678.zipn12345678 which contains the files
Assessment02.Rmd and Assessment02.html<U+0001F7E9><U+0001F7E8><U+0001F7E9><U+2B1B><U+0001F7E9><U+0001F7E8><U+2B1B><U+2B1B><U+0001F7E8><U+0001F7E9><U+0001F7E9><U+0001F7E9><U+0001F7E9><U+0001F7E8><U+2B1B><U+0001F7E8><U+2B1B><U+2B1B><U+0001F7E9><U+0001F7E8><U+0001F7E9><U+0001F7E9><U+0001F7E9><U+0001F7E8><U+0001F7E8><U+2B1B><U+0001F7E9><U+0001F7E9><U+2B1B><U+0001F7E8><U+2B1B><U+0001F7E9><U+0001F7E9><U+2B1B><U+2B1B><U+0001F7E8><U+2B1B><U+2B1B><U+0001F7E9><U+0001F7E8><U+0001F7E9><U+2B1B><U+0001F7E9><U+2B1B><U+0001F7E9><U+0001F7E9><U+0001F7E8><U+2B1B><U+0001F7E8>